Using projections and correlations to 
approximate probability distributions 



Dean KarlenQ 
Ottawa- Carleton Institute for Physics 
Department of Physics, Carleton University 
Ottawa, Canada KIS 5B6 
(May 7, 1998) 



Abstract 



A method to approximate continuous multi-dimensional probability density 
functions (PDFs) using their projections and correlations is described. The 
method is particularly useful for event classification when estimates of sys- 
tematic uncertainties are required and for the application of an unbinned 
maximum likelihood analysis when an analytic model is not available. A sim- 
ple goodness of fit test of the approximation can be used, and simulated event 
samples that follow the approximate PDFs can be efficiently generated. The 
source code for a FORTRAN-77 implementation of this method is available. 
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I. INTRODUCTION 



Visualization of mult i- dimensional distributions is often performed by examining single 
variable distributions (that is, one-dimensional projections) and linear correlation coefficients 
amongst the variables. This can be adequate when the sample size is small, the distribution 
consists of essentially uncorrelated variables, or when the correlations between the variables 
is approximately linear. This paper describes a method to approximate multi-dimensional 
distributions in this manner and its applications in data analysis. 

The method described in this paper, the Projection and Correlation Approximation 
(PCA), is particularly useful in analyses which make use of either simulated or control event 
samples. In particle physics, for example, such samples are used to develop algorithms 
that efficiently select events of one type while preferentially rejecting events of other types. 
The algorithm can be as simple as a set of criteria on quantities directly measured in the 
experiment or as complex as an application of an artificial neural network on a large 
number of observables. The more complex algorithm may result in higher efficiency and 
purity, but the determination of systematic errors can be difficult to estimate. The PCA 
method can be used to define a sophisticated selection algorithm with good efficiency and 
purity, in a way that systematic uncertainties can be reliably estimated. 

Another application of the PCA method is in parameter estimation from a data set 
using a maximum likelihood technique. If the information available is in the form of simu- 
lated event samples, it can be difficult to apply an unbinned maximum likelihood method, 
because it requires a functional representation of the multidimensional probability density 
function (PDF). The PCA method can be used to approximate the PDFs required for the 
maximum likelihood method. A simple goodness of fit test is available to determine if the 
approximation is valid. 

To verify the statistical uncertainty of an analysis, it can be useful to create a large en- 
semble of simulated samples, each sample equivalent in size to the data set being analyzed. 
In cases where this is not practical because of limited computing resources, the approxima- 
tion developed in the PCA method can be used, as it is in a form that leads to an efficient 
method for event generation. 

In the following sections, the projection and correlation approximation will be described 
along with its applications. An example data analysis using the PCA method is shown. 

II. PROJECTION AND CORRELATION APPROXIMATION 

Consider an arbitrary probability density function 'P(x) of n variables, Xj. The basis 
for the approximation of this PDF using the PCA approach is the n-dimensional Gaussian 
distribution, centered at the origin, which is described by an n x n covariance matrix, V , by 

G(y) = (27r)-"/2|y|-V2exp(-|y^y-iy) (1) 

where |V^| is the determinant of V . The variables x are not, in general, Gaussian distributed 
so this formula would be a poor approximation of the PDF, if used directly. Instead, the PCA 
method uses parameter transformations, yi{xi), such that the individual distributions for i/i 
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are Gaussian and, as a result, the n-dimensional distribution for y may be well approximated 
by Eq. (i. 

The monotonia function y{x) that transforms a variable x, having a distribution function 
p{x), to the variable y, which follows a Gaussian distribution of mean and variance 1, is 

y{x) = V2eTr^ {2F{x) - 1) (2) 

where erf~^ is the inverse error function and F{x) is the cumulative distribution of x, 

Fix) = ^^^--P^'''^^''' (3) 

The resulting n-dimensional distribution for y will not, in general, be an n-dimensional 
Gaussian distribution. It is only guaranteed that the projections of this distribution onto 
each yi axis is Gaussian. In the PGA approximation, however, the probability density 
function of y is assumed to be Gaussian. Although not exact, this can represent a good 
approximation of a mult i- dimensional distribution in which the correlation of the variables 
is relatively simple. 

Written in terms of the projections, Pi{xi), the approximation of 'P(x) using the PGA 
method is, 

n 

P(x) = |Vr^/^exp {-ly^iV-' - I)y) Y[Mx.) (4) 

i=l 

where V is the covariance matrix for y and / is the identity matrix. To approximate the 
projections, Pi{xi), needed in Eqs. (||) and @), binned frequency distributions (histograms) 
of Xi can be used. 

The projection and correlation approximation is exact for distributions with uncorrelated 
variables, in which case V = L It is also exact for a Gaussian distribution modified by mono- 
tonic one-dimensional variable transformations for any number of variables; or equivalently, 
multiplication by a non-negative separable function. 

A large variety of distributions can be well approximated by the PGA method. However, 
there are distributions for which this will not be true. For the PGA method to yield a 
good approximation in two-dimensions, the correlation between the two variables must be 
the same sign for all regions. If the space can be split into regions, inside of which the 
correlation has everywhere the same sign, then the PGA method can be used on each region 
separately. To determine if a distribution is well approximated by the PGA method, a 
goodness of fit test can be applied, as described in the next section. 

The generation of simulated event samples that follow the PGA PDF is straightforward 
and efficient. Events are generated in y space, according to Eq. (|l|), and then are transformed 
to the X space. The procedure involves no rejection of trial events, and is therefore fully 
efficient. 



III. GOODNESS OF FIT TEST 

Some applications of the PGA method do not require that the PDFs be particularly well 
approximated. For example, to estimate the purity and efficiency of event classification, 
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it is only necessary that the simulated or control samples are good representations of the 
data. Other applications, such as its use in maximum likelihood analyses, require the PDF 
to be a good approximation, in order that the estimators are unbiased and that the esti- 
mated statistical uncertainties are vahd. Therefore it may be important to check that the 
approximate PDF derived with the PCA method is adequate for a given problem. 

In general, when approximating a multidimensional distribution from a sample of events, 
it can be difficult to derive a goodness of fit statistic, like a statistic. This is because 
the required multidimensional binning can reduce the average number of events per bin to 
a very small number, much less than 1. 

When the PCA method is used, however, it is easy to form a statistic to test if a sample 
of events follows the PDF, without slicing the variable space into thousands of bins. The 
PCA method already ensures that the projections of the approximate PDF will match that 
of the event sample. A statistic that is sensitive to the correlation amongst the variables is 
most easily defined in the space of transformed variables, y, where the approximate PDF is 
an n-dimensional Gaussian. For each event the value X"^ is calculated. 



and if the events follow the PDF, the X"^ values will follow a distribution with n degrees 
of freedom, where n is the dimension of the Gaussian. A probability weight, w, can therefore 
be formed. 



which will be uniformly distributed between and 1, if the events follow the PDF. The 
procedure can be thought of in terms of dividing the n-dimensional y space into layers 
centered about the origin (and whose boundaries are at constant probability in y space) and 
checking that the right number of events appears in each layer. The goodness of fit test for 
the PCA distribution is therefore reduced to a test that the w distribution is uniform. 

When the goodness of fit test shows that the event sample is not well described by the 
projection and correlation approximation, further steps may be necessary before the PCA 
method can be applied to an analysis. To identify correlations which are poorly described, 
the goodness of fit test can be repeated for each pair of variables. If the test fails for a pair 
of variables, it may be possible to improve the approximation by modifying the choice of 
variables used in the analysis, or by treating different regions of variable space by separate 
approximations . 



Given two categories of events that follow the PDFs ■Pi(x) and 7^2 (x), the optimal event 
classification scheme to define a sample enriched in type 1 events, selects events having the 

largest values for the ratio of probabilities, R = Pi (x) /P2 (x) . Using simulated or control 
samples, the PCA method can be used to define the approximate PDFs -Pi(x) and -P2(x), 
and in order to define a quantity limited to the range [0, 1], it is useful to define a likelihood 
ratio 



(5) 




(6) 



IV. EVENT CLASSIFICATION 
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C= (7) 

Pl(x)+P2(x)- ^> 

With only two categories of events, it is irrelevant if the PDFs Pi and P2 are renormalized 
to their relative abundances in the data set. The generalization to more than two categories 
of events requires that the PDFs Pi be renormalized to their abundances. In either case, 
each event is classified on the basis of the whether or not the value of C for that event is 
larger than some critical value. 

Systematic errors in the estimated purity and efficiency of event classification can result if 
the simulated (or control) samples do not follow the true PDFs. To estimate the systematic 
uncertainties of the selection, the projections and covariance matrices used to define the 
PCA PDFs can be varied over suitable ranges. 



V. EXAMPLE APPLICATION 

In this section the PCA method and its applications are demonstrated with simple anal- 
yses of simulated event samples. Two samples, one labeled signal and the other background, 
are generated with, xi G (0, 10) and X2 G (0, 1), according to the distributions, 

dir r)- (xi-ai)2 + a2 

as[Xi, X2) 



(a3(xi - 04(1 + 05X2))^ + a6){{x2 - aj)^ + as) 

where the vectors of constants are given by a= (7, 2, 6, 4, 0.8, 40, 0.6, 2) and b= (0.1, 3, 0.1). 
These samples of 4000 events each correspond to simulated or control samples used in the 
analysis of a data set. In what follows it is assumed that the analytic forms of the parent 
distributions, Eq. (|), are unknown. 

The signal and background control samples are shown in Fig. [| and Fig. ^ respectively. 
A third sample, considered to be data and shown in Fig. |^, is formed by mixing a further 
240 events generated according to ds and 160 events generated according to db. 

The transformation given in Eq. (|^) is applied to the signal control sample, which results 
in the distribution shown in Fig. ^. To define the transformation, the projections shown in 
Fig. are used, 40 bins for each dimension. The projections of the transformed distribution 
are Gaussian, and the correlation coefficient is found to be 0.40. The goodness of fit test. 



described in section |ITI| , checks the assumption that the transformed distribution is a 2- 
dimensional Gaussian. The resulting wlX"^) distribution from this test is relatively uniform, 
as shown in Fig. |^. 

A separate transformation of the background control sample gives the distribution shown 
in Fig. ^, which has a correlation coefficient of 0.03. Note that a small linear correlation 
coefficient does not necessarily imply that the variables are uncorrelated. In this case the 
2-dimensional distribution is well described by 2-dimensional Gaussian, as shown in Fig. ^ 

Since the PCA method gives a relatively good approximation of the signal and back- 
ground probability distributions, an efficient event classification scheme can be developed. 
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as described in section Care needs to be taken, however, so that the estimation of the 
overall efficiency and purity of the selection is not biased. In this example, the approximate 
signal PDF is defined by 81 parameters (two projections of 40 bins, and one correlation 
coefficient) derived from the 4000 events in the signal control sample. These parameters will 
be sensitive to the statistical fluctuations in the control sample, and thus if the same control 
sample is used to optimize the selection and estimate the efficiency and purity, the estimates 
may be biased. To reduce this bias, additional samples are generated with the method de- 
scribed at the end of section |I|. These samples are used to define the 81 parameters, and the 
event classification scheme is applied to the original control samples to estimate the purity 
and efficiency. In this example data analysis, the bias is small. When the original control 
sample is used to define the 81 parameters, the optimal signal to noise is achieved with an 
efficiency of 0.880 and purity of 0.726. When the PCA generated samples are used instead, 
the selection efficiency is reduced to 0.873, for the same purity. 

When the classification scheme is applied to the data sample, 261 events are classified 
as signal events. Given the efficiency and purity quoted above, the number of signal events 
in the sample is estimated to be 217 ± 19. 

The number of signal events in the data sample can be more accurately determined by 
using a maximum likelihood analysis. The likelihood function is defined by 

400 

where the product runs over the 400 data events, fs is the fraction of events attributed to 
signal, and Ps and Pb are the PCA approximated PDFs, defined by Eq. (H). The signal 
fraction, estimated by maximizing the likelihood, is 0.617 ± 0.040, a relative uncertainty 
of 6.4% compared to the 8.5% uncertainty from the counting method. To check that the 
data sample is well described by the model used to define the hkelihood function, Eq. (H), 
the ratio of probabilities, Eq. (|^, is shown in Fig. |^, and compared to a mixture of PCA 
generated signal and background samples. 

VI. FORTRAN IMPLEMENTATION 

The source code for a FORTRAN- 77 implementation of the methods described in this 
paper is available from the author. The program was originally developed for use in an 
analysis of data from OPAL, a particle physics experiment located at CERN, and makes 
use of the CERNLIB library . An alternate version is also available, in which the calls to 
CERNLIB routines are replaced by calls to equivalent routines from NETLIB 0]. 
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FIGURES 




FIG. 1. The points represent the sample of 4000 events generated according to the function 
dg in Eq. (P), which are used as a control sample for the signal distribution. Contours of ds are 
shown to aid the eye. The two projections of the distribution are used by the PCA method to 
approximate the signal PDF. 
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FIG. 2. The points represent the sample of 4000 events generated according to the function db 
in Eq. (P), which are used as a control sample for the background distribution. Contours of db 
are shown to aid the eye. The two projections of the distribution are used by the PCA method to 
approximate the background PDF. 
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FIG. 3. The points represent the data sample of 400 events consistmg of 240 events generated 
according to the function dg and 160 generated according to db in Eq. @. 
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FIG. 4. The points show the distribution of the 4000 signal events after being transformed 
according to Eq. (|2|). The projections are now Gaussian distributions, centered at with width 1, 
and the overah distribution appears to follow a 2-dimensional Gaussian. The correlation coefficient 
is 0.40. 
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FIG. 5. The upper and lower histograms show the results of the goodness of fit test applied to 
the signal and background control samples. The values are 31 and 14 for 19 degrees of freedom, 
respectively. 
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FIG. 6. The points show the distribution of the 4000 background events after being transformed 
according to Eq. @. The correlation coefficient is 0.03, and the two variables appear to be 
uncor related. 
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FIG. 7. A check is made that the data sample is consistent with the model used in the maximum 
likelihood analysis. The distribution of the probability ratio, Eq. is shown for the data events 
and compared to the expected distribution, as given by a mixture of PCA generated signal and 
background samples. The agreement is good, the value for is 36 for 35 degrees of freedom. 
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